OpenVINO EP Weights Sharing Feature #23553

ankitm3k · 2025-01-31T11:32:25Z

Description

These changes are done to ensure that weight sharing happens between two model using session context option ep_weight_sharing. Key changes introduced in this feature are:

Creating a shared context between two models
Extracting external constant initializers and re labelling them back as inputs to the model to allow weight loading in the direct blob.
Creating EP Context Nodes when Subgraph partitioning is happening.

Motivation and Context

This change was required to ensure that LLM with prefill and kvcache models can use the same share
The change was also required to ensure EP Context nodes can be formed even when model is being subgraph partitioned.

ankitm3k · 2025-02-01T10:45:44Z

@jywu-msft @adrianlizarraga @HectorSVC Kindly Review & Merge

jywu-msft · 2025-02-01T20:33:44Z

/azp run Linux OpenVINO CI Pipeline

azure-pipelines · 2025-02-01T20:33:54Z

Azure Pipelines successfully started running 1 pipeline(s).

yihonglyu · 2025-02-04T13:52:34Z

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, Linux QNN CI Pipeline

azure-pipelines · 2025-02-04T13:53:15Z

Azure Pipelines successfully started running 8 pipeline(s).

yihonglyu

Could you update the PR title to better describe the changes you've made?

ankitm3k · 2025-02-04T15:42:59Z

Could you update the PR title to better describe the changes you've made?

Changed the title as requested

HectorSVC · 2025-02-05T03:48:06Z

/azp run Big Models,Linux Android Emulator QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows GPU TensorRT CI Pipeline,Windows x64 QNN CI Pipeline

azure-pipelines · 2025-02-05T03:48:11Z

Pull request contains merge conflicts.

yihonglyu · 2025-02-05T04:08:39Z

Pull request contains merge conflicts.

Could you resolve the merge conflict?

ankitm3k · 2025-02-05T07:58:11Z

Pull request contains merge conflicts.

Could you resolve the merge conflict?

Fixed the conflicts kindly review & merge

jywu-msft · 2025-02-05T08:03:06Z

/azp run Big Models,Linux Android Emulator QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows GPU TensorRT CI Pipeline,Windows x64 QNN CI Pipeline

jywu-msft · 2025-02-05T08:03:23Z

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, Linux QNN CI Pipeline

jywu-msft · 2025-02-05T08:03:36Z

/azp run Linux OpenVINO CI Pipeline

azure-pipelines · 2025-02-05T08:03:44Z

Azure Pipelines successfully started running 10 pipeline(s).

azure-pipelines · 2025-02-05T08:03:46Z

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines · 2025-02-05T08:04:07Z

Azure Pipelines successfully started running 8 pipeline(s).

yihonglyu · 2025-02-05T08:12:18Z

It seems there is no unit test for this feature for OpenVINO EP. Could you please add a unit test for it?

onnxruntime/core/providers/openvino/onnx_ctx_model_helper.cc

HectorSVC · 2025-02-05T18:37:17Z

It seems there is no unit test for this feature for OpenVINO EP. Could you please add a unit test for it?

Yes. It would be great to have something demonstrate how this feature get used from model generation to model inference.
Here's an example from QNN EP:
https://github.com/microsoft/onnxruntime/blob/e1e3f623f61816008e79dddc91a51ffe7f0ff5cf/onnxruntime/test/providers/qnn/qnn_ep_context_test.cc#L1048C47-L1048C58

onnxruntime/core/providers/openvino/ov_versions/capability.cc

javier-intel · 2025-02-05T22:43:43Z

It seems there is no unit test for this feature for OpenVINO EP. Could you please add a unit test for it?

Yes. It would be great to have something demonstrate how this feature get used from model generation to model inference. Here's an example from QNN EP: https://github.com/microsoft/onnxruntime/blob/e1e3f623f61816008e79dddc91a51ffe7f0ff5cf/onnxruntime/test/providers/qnn/qnn_ep_context_test.cc#L1048C47-L1048C58

Agreed, the OVEP unit tests are, well, absent. We started working on adding OVEP unit tests but those will come later and not be part of this PR.

Co-authored-by: saurabh <[email protected]>

* Rename EP instance context as session_context * Add support for GetEpContextNodes * enable config option for ovep weight sharing * add config option for ovep weight sharing * Refactor the conditional blocks in OVEP for compilation * Convert initializers with external data to graph inputs * create, store and export metadata for ovep weight sharing * fix error handling in weight sharing * fix crash issue while setting up inputs for wai model * pass weight sharing option to OVEP qdq stripping pass * Aligning OVEP variable names to match the session option value they hold * Add plumbing for context sharing plus refactoring around option handling * Store metadata in shared context * fix: fix provider options * create ov tensor from meta data and external data * create ov tensor * Add support for binding weight as input tensors * Fix for mapping subgraph to ov compiled network arguments * Fix for using so_share_ep_contexts without ep.context* flags * Add remote tensor support for NPU weight sharing * Use a single ov::Core copy across OVEP * Decouple provider option cache_dir from session option ep.context_file_path * Add support for serialization and deserialization of metadata to disk * Load blobs from relative path stored in ep_cache_context * Use remote L0 tensors for shared weights * fix linux ci issues * fix ci issues * Fix Windows build failure * Use ifstream to load weights instead of mmaped file * Fix for epctx models made up entirely of OVEP epctx nodes * Limit ov::Core lifetime to that of provider object * Enforce shared tensors cleanup on shutdown * Add support for default device type based on project configuration * fix: Fixed concrete_backend_ pointer double free issue on Linux * Preetha/weight sharing fix (#545) * Move variables from subgraph to session context for model specific properties * Fix for redundant subgraph creation * Remove unused variable --------- Co-authored-by: Javier E. Martinez <[email protected]> Co-authored-by: saurabhkale117 <[email protected]> Co-authored-by: Preetha Veeramalai <[email protected]> Co-authored-by: ankitm3k <[email protected]> Co-authored-by: Eric Crawford <[email protected]>

Co-authored-by: saurabh <[email protected]>

* Fix blob generation with AUTO:GPU,CPU * Remove unused variable

* Use ep.context_file_path to get base path when creating session from memory * Fixed lint issues --------- Co-authored-by: Javier E. Martinez <[email protected]>

jywu-msft · 2025-02-06T16:16:29Z

/azp run Linux OpenVINO CI Pipeline

azure-pipelines · 2025-02-06T16:16:41Z

Azure Pipelines successfully started running 1 pipeline(s).

HectorSVC · 2025-02-06T17:14:11Z

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, Linux QNN CI Pipeline

HectorSVC · 2025-02-06T17:14:27Z

/azp run Big Models,Linux Android Emulator QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows GPU TensorRT CI Pipeline,Windows x64 QNN CI Pipeline

azure-pipelines · 2025-02-06T17:14:56Z

Azure Pipelines successfully started running 8 pipeline(s).

azure-pipelines · 2025-02-06T17:15:09Z

Azure Pipelines successfully started running 10 pipeline(s).

It was updated according to the comments.

### Description These changes are done to ensure that weight sharing happens between two model using session context option ep_weight_sharing. Key changes introduced in this feature are: Creating a shared context between two models Extracting external constant initializers and re labelling them back as inputs to the model to allow weight loading in the direct blob. Creating EP Context Nodes when Subgraph partitioning is happening. ### Motivation and Context This change was required to ensure that LLM with prefill and kvcache models can use the same share The change was also required to ensure EP Context nodes can be formed even when model is being subgraph partitioned. --------- Co-authored-by: jatinwadhwa921 <[email protected]> Co-authored-by: jatinwadhwa921 <[email protected]> Co-authored-by: saurabh <[email protected]> Co-authored-by: TejalKhade28 <[email protected]> Co-authored-by: sfatimar <[email protected]> Co-authored-by: Javier E. Martinez <[email protected]> Co-authored-by: Preetha Veeramalai <[email protected]> Co-authored-by: Eric Crawford <[email protected]>

### Description This PR is to update the win-ort-main branch to the tip main branch as of 2025-02-11. ### PR List 74c778e [WebNN EP] Automatically move input CPU tensors to ml-tensor (#23073) 3775057 use correct total length to fix static kv_cache performance (#23615) 3901e96 remove --use_vcpkg flag for Python-CUDA-Packaging-Pipeline (#23631) c610df5 Add python_requires to package metadata (#23604) 2d27d68 [QNN EP] Add QNN EP to ARM64X build targets (#23635) e666503 [webgpu] no longer need pass-in gpu adapter for custom context (#23593) af679a0 Fix logic for selecting alternate name for blob (#23617) e206950 [ARM CPU] Add fp16 mlas kernels for exp, tanh, softmax, logsoftmax, softcap (#23597) 9ba5619 Update pybind and json to the latest (#23589) c54736c Migrate iOS release pipeline to 1 ES (#23606) 3981326 Increase timeout for Windows TensorRT CI (#23625) 0274b7b fix on trtCudaVersion (#23616) 740e9ab update run CI script (#23621) 5ef1832 [WebGPU] Support PIX Capture for WebGPU EP (#23192) 0114551 Fix for C4267 warning (#23610) 002916a Validate the context_file_path before EP compile graphs (#23611) 0887e36 [webgpu] Use pushErrorScope()/popErrorScope() once for an inference run (#23438) 65008cb Auto-generated baselines by 1ES Pipeline Templates (#23603) 09e5724 [CUDA] Fix beam search of num_beams > 32 (#23599) 82840f6 Implement Flash Attention 2 for webgpu EP (#23576) a6ea57b OpenVINO EP Weights Sharing Feature (#23553) 2c2ff4a [CUDA] Fix BeamSearchTest.DummyT5WithSequenceInputIds test failure in Windows (#23596) d981b15 [webgpu/js] Optimize resize webgpu op & fix precision issues (#23591) 328a13c Enable VCPKG in more pipelines (#23590) 6728d60 [TensorRT EP] support TensorRT 10.8-GA (#23592) d1fb58b Quantization tool: Allow user to override calibrator's session EP (#23559) 649ced4 Enable user loading model with external data from memory buffer (#23557) 544bdd6 Fix ConvTranspose for certain attribute combinations (#23488) 8f6ddf3 Delete extra cgmanifest entries and files (#23583) 5f6a315 Enable VCPKG in CI build (#23426) e1e3f62 Bump lintrunner from 0.12.5 to 0.12.7 (#23326) cd8775f Fix Node JS Samples (#23581) 6b4f9c4 [WebGPU EP] Batch Norm Implementation (#23525) 1fce51b Fix all instances of 4244 and 4267 warnings in OV EP code (#23567) c29ca1c Update QNN default version to 2.31 (#23573) 2fc75a4 [mobile] Add Android BrowserStack test project back (#23551) 9e18b6a [CUDA] Update nvcc flags (#23572) b47e1e6 [QNN EP] Make offloading graph input/output quantization (to CPU) the default (#23368) 75a9b40 [ROCm] Update CI to use rocm 6.3.2 (#23577) 26ff2b6 Bump ruff from 0.9.3 to 0.9.4 (#23563) b2560a7 Update react-native to 0.72 (#23509) faee912 [js] update JavaScript API to support QNN EP options (#23486) 816e8cb [EP Perf] Update env to ubuntu 22.04 (#23570) cddc271 Use Eigen in Round implementation (#23571) e8b0bdb Shape inference: ReduceMean dispatcher, quant_pre_process: skip_symbolic_shape bugfix (#23558) 267b493 delete the supported domain version upper bounds (#23237) bb7f961 remove log spam from cpuinfo (#23548) 169917b Use latest vcpkg commit in configuration, sync manifest with deps.txt (#23554) a9d4d08 Add of ReduceMax Gradient (#23501) 6bbf1bd [js/web] upgrade version of flatbuffers (#23545) 271c509 DP4AMatMul perf refinements (#23539) cb69c59 Add fusions for SigLIP and Conformer-Encoder (#23528) 61fae9b Remove "--enable_pybind" from webgpu pipeline (#23550) 0bb4ea6 Update BiasGelu fusion and related ops (#23518) 4dde74a Add more details to BrowserStack script failure (#23520) ead9d5c Set ANDROID_USE_LEGACY_TOOLCHAIN_FILE to false (#23544) 7e24088 Enable dlpack by default (#23110) dc2f7a9 Add overload of `TryParseStringWithClassicLocale()` that uses `std::from_chars()` (#23541) 5407c69 Fix the issue that the new generated EP context model not able to find external data (#23537) fbae88f [js/web] use the recommended workaround for Vite (#23531) d5338da Fix tensor external data info length parsing issue. (#23526) e3e4173 [ROCm EP] Fix transpose helper for gfx gridsize constraints (#23527) 80bc1d2 Enable Ep context with external data for CPU nodes (#23498) bf023ab [js/web] allow import .mjs/.wasm file (#23487) 655a23f [onnxruntime/build] Add new flag enable_generic_interface to build primary EPs by default (#23342) a770a8d Update RN to 0.71.19 (#23381) 1cf0ebd Delete Prefast workflow until the build failure is fixed (#23510) d2c5e24 Add of GlobalMaxPool Gradient (#23502) ded8730 Remove thrust::unary_function (#23506) 8db97a6 [webgpu] Bump version of Dawn to b9b4a370 (#23494) fdde2e2 Fix for gcc 13.3.1: Avoid creating a copy (#23500) 96ec1dd Bump ruff from 0.9.2 to 0.9.3 (#23496) 42f0c00 Adds the new System.Numerics.Tensors as an input/output type when using dotnet 8.0 and up. (#23261) 97c2bbe Fix shape infer of onnx GroupNorm (#23477) 1fc9c48 Enable coremltools for Linux build (#23481) 13348c5 [ARM CPU] hgemm optimized for gqa (#23107) c89a798 Enable opti on Microsoft.ML.OnnxRuntime with RelWithDebInfo config (#23463) d00ae32 Revert "[Mobile] Add BrowserStack Android MAUI Test (#23383)" (#23474) 8b1d3b3 Align AvgPool ceil_mode on last value to torch (#16752) 06fc73b [TRT EP Perf Tool] Add annotations import to python script to support annotations on Python 3.8 (#23466) ### Motivation and Context This update includes the change to add QNN EP to ARM64X build targets. --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: Adrian Lizarraga <[email protected]> Co-authored-by: Ti-Tai Wang <[email protected]> Co-authored-by: Caroline Zhu <[email protected]> Co-authored-by: Grégoire <[email protected]> Co-authored-by: Jing Fang <[email protected]> Co-authored-by: Changming Sun <[email protected]> Co-authored-by: Yateng Hong <[email protected]> Co-authored-by: Michael Sharp <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Malik Shahzad Muzaffar <[email protected]> Co-authored-by: Yulong Wang <[email protected]> Co-authored-by: Dmitri Smirnov <[email protected]> Co-authored-by: Corentin Maravat <[email protected]> Co-authored-by: Jian Chen <[email protected]> Co-authored-by: Karim Vadsariya <[email protected]> Co-authored-by: Lei Cao <[email protected]> Co-authored-by: Karim Vadsariya <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Hector Li <[email protected]> Co-authored-by: Ted Themistokleous <[email protected]> Co-authored-by: Ted Themistokleous <[email protected]> Co-authored-by: Edward Chen <[email protected]> Co-authored-by: Takeshi Watanabe <[email protected]> Co-authored-by: Xavier Dupré <[email protected]> Co-authored-by: Justin Chu <[email protected]> Co-authored-by: Tianlei Wu <[email protected]> Co-authored-by: kunal-vaishnavi <[email protected]> Co-authored-by: Sushanth Rajasankar <[email protected]> Co-authored-by: PARK DongHa <[email protected]> Co-authored-by: George Wu <[email protected]> Co-authored-by: Xinpeng Dou <[email protected]> Co-authored-by: Jambay Kinley <[email protected]> Co-authored-by: Yifan Li <[email protected]> Co-authored-by: Gavin Kinsey <[email protected]> Co-authored-by: Prathik Rao <[email protected]> Co-authored-by: Jon Campbell <[email protected]> Co-authored-by: Satya Kumar Jandhyala <[email protected]> Co-authored-by: Joshua Lochner <[email protected]> Co-authored-by: Ankit Maheshkar <[email protected]> Co-authored-by: jatinwadhwa921 <[email protected]> Co-authored-by: jatinwadhwa921 <[email protected]> Co-authored-by: saurabh <[email protected]> Co-authored-by: TejalKhade28 <[email protected]> Co-authored-by: sfatimar <[email protected]> Co-authored-by: Javier E. Martinez <[email protected]> Co-authored-by: Preetha Veeramalai <[email protected]> Co-authored-by: Eric Crawford <[email protected]> Co-authored-by: microsoft-github-policy-service[bot] <77245923+microsoft-github-policy-service[bot]@users.noreply.github.com> Co-authored-by: Jie Chen <[email protected]> Co-authored-by: shaoboyan091 <[email protected]> Co-authored-by: David Hotham <[email protected]> Co-authored-by: Guenther Schmuelling <[email protected]> Co-authored-by: Enrico Galli <[email protected]>

ankitm3k changed the title ~~Ovep weight sharing msft~~ OpenVINO EP Feature Updates Feb 1, 2025

jywu-msft requested review from yihonglyu and HectorSVC February 4, 2025 02:37

yihonglyu previously requested changes Feb 4, 2025

View reviewed changes

ankitm3k changed the title ~~OpenVINO EP Feature Updates~~ OpenVINO EP Weights Sharing Feature Feb 4, 2025

ankitm3k force-pushed the ovep-weight-sharing-msft branch from d8857de to 6371811 Compare February 5, 2025 07:54

HectorSVC reviewed Feb 5, 2025

View reviewed changes

onnxruntime/core/providers/openvino/onnx_ctx_model_helper.cc Show resolved Hide resolved

HectorSVC reviewed Feb 5, 2025

View reviewed changes

onnxruntime/core/providers/openvino/ov_versions/capability.cc Show resolved Hide resolved

ankitm3k force-pushed the ovep-weight-sharing-msft branch from bf4dc5b to 568a64d Compare February 6, 2025 13:12

jatinwadhwa921 and others added 2 commits February 6, 2025 18:45

Removing testcase due to output mismatch

914d2f8

enable OVEP qdq stripping for dynamic models (#522) (#524)

7b86a2e

Co-authored-by: saurabh <[email protected]>

ankitm3k and others added 7 commits February 6, 2025 18:45

fix: Eric changes for Running subgraphs with model inputs

7fb4c2f

Upgraded to OpenVINO v2025.1

bb8af6d

fix lint issues (#550) (#552)

35e13d6

Co-authored-by: saurabh <[email protected]>

Fixed coverity issues (#555)

bd5d634

Fix blob generation with AUTO:GPU,CPU (#553)

98bf214

* Fix blob generation with AUTO:GPU,CPU * Remove unused variable

Epctx node base path fix and lint fix (#569)

568a64d

* Use ep.context_file_path to get base path when creating session from memory * Fixed lint issues --------- Co-authored-by: Javier E. Martinez <[email protected]>

HectorSVC approved these changes Feb 6, 2025

View reviewed changes

HectorSVC merged commit a6ea57b into microsoft:main Feb 6, 2025
76 checks passed

sfatimar mentioned this pull request Feb 7, 2025

Fix for C4267 warning #23610

Merged

ashrit-ms mentioned this pull request Feb 11, 2025

Update win-ort-main to tip main 250211 #23646

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenVINO EP Weights Sharing Feature #23553

OpenVINO EP Weights Sharing Feature #23553

ankitm3k commented Jan 31, 2025

ankitm3k commented Feb 1, 2025

jywu-msft commented Feb 1, 2025

azure-pipelines bot commented Feb 1, 2025

yihonglyu commented Feb 4, 2025

azure-pipelines bot commented Feb 4, 2025

yihonglyu left a comment

ankitm3k commented Feb 4, 2025

HectorSVC commented Feb 5, 2025

azure-pipelines bot commented Feb 5, 2025

yihonglyu commented Feb 5, 2025

ankitm3k commented Feb 5, 2025

jywu-msft commented Feb 5, 2025

jywu-msft commented Feb 5, 2025

jywu-msft commented Feb 5, 2025

azure-pipelines bot commented Feb 5, 2025

azure-pipelines bot commented Feb 5, 2025

azure-pipelines bot commented Feb 5, 2025

yihonglyu commented Feb 5, 2025

HectorSVC commented Feb 5, 2025

javier-intel commented Feb 5, 2025

jywu-msft commented Feb 6, 2025

azure-pipelines bot commented Feb 6, 2025

HectorSVC commented Feb 6, 2025

HectorSVC commented Feb 6, 2025

azure-pipelines bot commented Feb 6, 2025

azure-pipelines bot commented Feb 6, 2025

OpenVINO EP Weights Sharing Feature #23553

OpenVINO EP Weights Sharing Feature #23553

Conversation

ankitm3k commented Jan 31, 2025

Description

Motivation and Context

ankitm3k commented Feb 1, 2025

jywu-msft commented Feb 1, 2025

azure-pipelines bot commented Feb 1, 2025

yihonglyu commented Feb 4, 2025

azure-pipelines bot commented Feb 4, 2025

yihonglyu left a comment

Choose a reason for hiding this comment

ankitm3k commented Feb 4, 2025

HectorSVC commented Feb 5, 2025

azure-pipelines bot commented Feb 5, 2025

yihonglyu commented Feb 5, 2025

ankitm3k commented Feb 5, 2025

jywu-msft commented Feb 5, 2025

jywu-msft commented Feb 5, 2025

jywu-msft commented Feb 5, 2025

azure-pipelines bot commented Feb 5, 2025

azure-pipelines bot commented Feb 5, 2025

azure-pipelines bot commented Feb 5, 2025

yihonglyu commented Feb 5, 2025

HectorSVC commented Feb 5, 2025

javier-intel commented Feb 5, 2025

jywu-msft commented Feb 6, 2025

azure-pipelines bot commented Feb 6, 2025

HectorSVC commented Feb 6, 2025

HectorSVC commented Feb 6, 2025

azure-pipelines bot commented Feb 6, 2025

azure-pipelines bot commented Feb 6, 2025